Georgia Institute of Technology – GeneTracer

VAST 2010 Challenge
Genetic Sequences – Tracing the Mutations of a Disease

Authors and Affiliations:

Hanseung Lee, Georgia Institute of Technology, hanseung.lee@gatech.edu [PRIMARY contact]
Jaegul Choo, Georgia Institute of Technology, joyfull@cc.gatech.edu
Carsten Gorg, Georgia Institute of Technology, goerg@cc.gatech.edu
Jaeyeon Kihm, Georgia Institute of Technology, jkihm3@gatech.edu
Zhicheng Liu, Georgia Institute of Technology, zliu6@gatech.edu
Jaeeun Shim, Georgia Institute of Technology, jaeeun.shim@gatech.edu
Haesun Park, Georgia Institute of Technology, hpark@cc.gatech.edu
John Stasko, Georgia Institute of Technology, stasko@cc.gatech.edu

Tool(s):

We developed a system, GeneTracer, to visualize various data related with genetic sequence. It has three views, Gene Sequence view, Disease Characteristic view, and Graph view.

 

Gene Sequence view visualizes the current outbreak sequences and native sequences.

l  Colors of each gene base: A (red), T (green), C (purple), G (blue)

l  Heatmap row vector (first row): represents how much a gene position has different gene bases across different categories of characteristics

l  Heatmap column vector (first column): represents how much a sequence is different from the selected row

l  Interactions: Removing/Moving column and row with mouse click/drag, multi selection with ctrl key

Disease Characteristic view visualizes each sequence's characteristics using colors. The darker the value, the more severe it is.

l  Color coordinates: Symptoms (red), Mortality (blue), Complications (green), Resistance (purple), Vulnerability (orange)

l  Interactions: Sort by each characteristic, index, or total weight, multi-selection for reordering/removing sequences in Gene Sequence view

Gene Sequence view and Disease Characteristic view can interact with each other.

l  Selected from one view, it’s also selected in the other

l  Pressing "Sync" or “Sync always” button from one view, the other view's sequences are reordered to align with it.

Graph view visualizes the relations among the sequences. (also shows the MST)

l  Node: sequences' indices and countries of native sequences

l  Weight of edges: hamming distance

Graph view can also interact with other views. This was implemented based on open source JUNG, the Java Universal Network/Graph Framework.

 

Video:

 

VAST-MC3.mov

 

ANSWERS:


MC3.1: What is the region or country of origin for the current outbreak?  Please provide your answer as the name of the native viral strain along with a brief explanation.

Nigeria_B is the country of origin for the current outbreak. GeneTracer first removes the identical gene bases across all the sequences, thus giving us a much more manageable number of bases. Next, GeneTracer constructed the graph and calculated the Minimum Spanning Tree (MST) in the Graph view. In figure 2, we found that Nigeria_B is the nearest native sequence from the current outbreaks, and thus it could be a candidate. Also in the Gene Sequence view (figure 1), we can interactively reorder rows and columns by dragging them upwards closer to the outbreak sequences making the comparison easier. In addition, we have a filter operation to remove some of the sequences that were clearly dissimilar. By interactively exploring the data in this manner we found that Nigeria_B was by far the most similar sequence to the current outbreak, which matched with the result of the Graph view.


MC3.2:  Over time, the virus spreads and the diversity of the virus increases as it mutates.  Two patients infected with the Drafa virus are in the same hospital as Nicolai.  Nicolai has a strain identified by sequence 583.  One patient has a strain identified by sequence 123 and the other has a strain identified by sequence 51.  Assume only a single viral strain is in each patient.  Which patient likely contracted the illness from Nicolai and why?  Please provide your answer as the sequence number along with a brief explanation.

To solve the second problem, we only need to observe the strains identified by sequence 583, 123, and 51. We gathered these three sequences at the top of the Gene Sequence view using mouse drag and drop interactions. By observing the heatmap column vector, we found that 583 have a larger similarity (lighter color) with 123 than that with 51. We also filtered the columns that have the same gene bases among the three sequences. As a result, we got only four columns as shown in <Figure 3>.  From this analysis, we found that sequence 123 has only one different gene base (column index 269) whereas sequence 51 has three different gene bases (column index 494, 842, and 946) compared to gene sequence 583. Therefore, we can conclude that the patient that has a strain identified by sequence 123 is likely to be contracted from Nicolai (sequence 583).


MC3.3:  Signs and symptoms of the Drafa virus are varied and humans react differently to infection.  Some mutant strains from the current outbreak have been reported as being worse than others for the patients that come in contact with them. 

Identify the top 3 mutations that lead to an increase in symptom severity (a disease characteristic).  The mutations involve one or more base substitutions.  For this question, the biological properties of the underlying amino acid sequence patterns are not significant in determining disease characteristics.

For each mutation provide the base substitutions and their position in the sequence (left to right) where the base substitutions occurred. For example,

C → G, 456 (C changed to G at position 456)

G → A, 513 and T → A, 907 (G changed to A at position 513 and T changed to A at position 907)

A → G, 39 (A changed to G at position 39)

 

Answers: <Figure 4, 5>
1)  A → C, 269
2)  A → T, 946 and T → C, 842
3)  A → G, 223

 

Mutation A → C, 269 only occurred in severe symptoms (at sequence 99, 118, 123 and 997), so it’s strongly related with symptom severity.

Mutation A → T, 946 and T → C, 842 occurred highly in the severe and moderate symptoms. Notice even if T → C, 842 occurs, if A → T, 946 doesn't, it lies in mild symptom (e.g., sequence 49 and 961). Therefore, if these two mutations occur at the same time, it increases the symptom's severity.

For the third mutation, we obtained three candidates, A → G, 223, A → C, 197 and G → C, 212. We finally selected A → G, 223 since the last two candidates’ changed sequences are overlapped with other two mutations we first found.


MC3.4:  Due to the rapid spread of the virus and limited resources, medical personnel would like to focus on treatments and quarantine procedures for the worst of the mutant strains from the current outbreak, not just symptoms as in the previous question.  To find the most dangerous viral mutants, experts are monitoring multiple disease characteristics.

Consider each virulence and drug resistance characteristic as equally important.  Identify the top 3 mutations that lead to the most dangerous viral strains. The mutations involve one or more base substitutions.  In a worst case scenario, a very dangerous strain could cause severe symptoms, have high mortality, cause major complications, exhibit resistance to anti viral drugs, and target high risk groups.  For this question, the biological properties of the underlying amino acid sequence patterns are not significant in determining disease characteristics.

For each mutation provide the base substitutions and their position in the sequence (left to right) where the base substitutions occurred. For example,

C → G, 456 (C changed to G at position 456)

G → A, 513 and T → A, 907 (G changed to A at position 513 and T changed to A at position 907)

A → G, 39 (A changed to G at position 39).

 

Answers:
1)  A → T, 946 and T → C, 842
2)  T → C, 790
3)  A → G, 223

 

Unlike the previous problem, here we had to consider not only symptoms but also all the other characteristics. Our approach consists of three steps.

For the first step, we reordered the Gene Sequence view with each sorted result from the Disease Characteristic view. We used both views together with interactions. We sorted and reordered the sequence based on the characteristics and then we applied this order to the Gene Sequence view using the “Sync” function key. In this process, our tool shows the boundary of different levels of a certain characteristic with a blue thick line in the Gene Sequence view. Then, we reorganized the gene sequence view by moving the columns that have large variance to the left side. Also, we changed the order of sequences (rows) to place similar gene bases together within the same category of characteristics. Through these interactive steps, we could see the patterns of gene sequences and find some potential mutations that are critical for each characteristic. As a result, we could find two to five candidates of critical mutations per each characteristic. Some examples are shown in figures 7- 8.

The second step is to do the same process as in the first step, except that the gene sequence is sorted by total weight. We already assigned one to three to the weight for each characteristic value, and the total weight is determined by aggregating all the characteristic weight. In this gene view we reorganized based on the total weight. Then we focused on critical mutation candidates from the first strategy and analyzed again by moving, removing, or filtering columns/rows.

Finally, the third step is to support the speed to find and verify our answer. We used several data mining techniques to choose which genes contain the significant information. Regression and decision trees were helpful to give the clues to the answers. From the results, we chose the column indices with the highest coefficient values and also examined a few nodes near the root of the decision tree. We assigned a value for each gene base and treated it as unordered categorical variables. In this step, we also assigned 1 for mild symptoms, 2 for moderate symptoms, and 3 for severe symptoms. Therefore, for 58 gene sequences each with 1404 gene bases, we created 58-by-1404 matrix X of predictor values along with vector Y consist of 58 response values. Since a decision tree allows combining gene bases that have similar values with respect to the level of some target value, there is less information loss in collapsing gene bases together. This leads to an improved classification result. Linear regression was also used for users to explore initial column indices (gene base positions) that could be potential answers. We formed the matrix that consists of tuples of pairs of gene sequences, and the position where base substitution occurred. Data is the pair of indices and we are trying to find the pair with the highest influence on the total weight of characteristics. We did it for the first-order linear regression, second-order linear regression, and for the first- and second-order linear regression combined. We selected the top few coefficients with the corresponding column indices and started doing visual analytic work with our tool. Even though these techniques didn't give the answers we expected, but starting from some of the suggested column indices supported our visualization tool to work better. Also, we can verify the answer we made from qualitative decision making process is correct from quantitative results.


From the first step, we could get the following candidate mutations.

Symptoms: A → T at 946 and T → C at 842 / A → C at 269 / A → G at 223
Mortality: A → T at 946 and T → C at 842 / T → C at 790 / A → C at 269
Complication: T → C at 790/ A → G at 223
Resistance: T → C at 790 / A → T at 946 and T → C at 842 / A → C at 269 / A → G at 223 / A → C at 197
Vulnerability: A → T at 946 and T → C at 842 / G → C at 212

In the second step, we could analyze the Gene Sequence view such as <Figure 6>. By synthesizing with first step's candidates, we could determine the top three mutations considering the whole characteristics. By looking at figure 6, let’s see squares which have both green in position 946 and purple in position 842. We can see that those squares are mostly crowded on the upper side which means they have large total weights. Also we can see purple squares in position 790 and check that they are also mostly on the upper side of this view. At last, we can check on the blue squares in position 223. The mutation that occurred at position 223 have at least a total weight of 11. This means this kind of mutation is very dangerous. In conclusion, we can also check and verify the answers using visualization very easily. Making the same colors gather together make the user understand more easily.

In addition, we can see some interesting parts too. For example, if a G → C mutation occurs in position 22, the total weight is quite low. This means that this mutation lead to a stable strain and can be a potential cure to the current outbreak.

 

In the third step, we verified that the answer was correct using some results from regression and decision trees. For example, figure 9 is a decision tree output which has some nodes near the root which are some column indices from the answer. This means, in these column indices, mutated and non-mutated gene bases mostly discriminate the category of characteristics. Therefore, we can conclude that the answer is correct.

 

 

 

Figures:

Figure 1: fig1_MC3_1.jpg

 

 

fig1_MC3_1

 

 

Figure 2: fig2_MST.jpg

 

 

fig2_MST

 

 

Figure 3: fig3_MC3_2.jpg

 

 

fig3_MC3_2

 

 

Figure 4: fig4_MC3_3.jpg

 

 

fig4_MC3_3

 

Figure 5: fig5_MC3_3.jpg

 

 

fig5_MC3_3

 

 

Figure 6: fig6_MC3_4.jpg

 

 

fig6_MC3_4

 

 

Figure 7: fig7_MC3_4.jpg

 

 

fig7_MC3_4

 

 

Figure 8: fig8_MC3_4.jpg

 

 

fig8_MC3_4

 

 

Figure 9: fig9_decision_tree.jpg

 

 

fig9_decision_tree

 

 

Figures (original files):

Figure 1: fig1_MC3_1.jpg

Figure 2: fig2_MST.jpg

Figure 3: fig3_MC3_2.jpg

Figure 4: fig4_MC3_3.jpg
Figure 5: fig5_MC3_3.jpg
Figure 6: fig6_MC3_4.jpg
Figure 7: fig7_MC3_4.jpg

Figure 8: fig8_MC3_4.jpg
Figure 9: fig9_decision_tree.jpg